Unit 6 Clustering
这一周主要介绍聚类,分别介绍了Hierarchical以及K-means的使用。
课程地址:
https://www.edx.org/course/the-analytics-edge
setwd("E:\\The Analytics Edge\\Unit 6 Clustering")
这次的数据不是csv格式,要使用read.table函数来读取。
movies = read.table("movieLens.txt", header=FALSE, sep="|",quote="\"")
str(movies)
'data.frame': 1682 obs. of 24 variables:
$ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ V2 : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1524 617 554 593 342 1317 1544 110 390 1239 ...
$ V3 : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
$ V4 : logi NA NA NA NA NA NA ...
$ V5 : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ...
$ V6 : int 0 0 0 0 0 0 0 0 0 0 ...
$ V7 : int 0 1 0 1 0 0 0 0 0 0 ...
$ V8 : int 0 1 0 0 0 0 0 0 0 0 ...
$ V9 : int 1 0 0 0 0 0 0 0 0 0 ...
$ V10: int 1 0 0 0 0 0 0 1 0 0 ...
$ V11: int 1 0 0 1 0 0 0 1 0 0 ...
$ V12: int 0 0 0 0 1 0 0 0 0 0 ...
$ V13: int 0 0 0 0 0 0 0 0 0 0 ...
$ V14: int 0 0 0 1 1 1 1 1 1 1 ...
$ V15: int 0 0 0 0 0 0 0 0 0 0 ...
$ V16: int 0 0 0 0 0 0 0 0 0 0 ...
$ V17: int 0 0 0 0 0 0 0 0 0 0 ...
$ V18: int 0 0 0 0 0 0 0 0 0 0 ...
$ V19: int 0 0 0 0 0 0 0 0 0 0 ...
$ V20: int 0 0 0 0 0 0 0 0 0 0 ...
$ V21: int 0 0 0 0 0 0 1 0 0 0 ...
$ V22: int 0 1 1 0 1 0 0 0 0 0 ...
$ V23: int 0 0 0 0 0 0 0 0 0 1 ...
$ V24: int 0 0 0 0 0 0 0 0 0 0 ...
给数据每列增加名字。
colnames(movies) = c("ID", "Title", "ReleaseDate", "VideoReleaseDate", "IMDB", "Unknown", "Action", "Adventure", "Animation", "Childrens", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "FilmNoir", "Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller", "War", "Western")
str(movies)
'data.frame': 1682 obs. of 24 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Title : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1524 617 554 593 342 1317 1544 110 390 1239 ...
$ ReleaseDate : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
$ VideoReleaseDate: logi NA NA NA NA NA NA ...
$ IMDB : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ...
$ Unknown : int 0 0 0 0 0 0 0 0 0 0 ...
$ Action : int 0 1 0 1 0 0 0 0 0 0 ...
$ Adventure : int 0 1 0 0 0 0 0 0 0 0 ...
$ Animation : int 1 0 0 0 0 0 0 0 0 0 ...
$ Childrens : int 1 0 0 0 0 0 0 1 0 0 ...
$ Comedy : int 1 0 0 1 0 0 0 1 0 0 ...
$ Crime : int 0 0 0 0 1 0 0 0 0 0 ...
$ Documentary : int 0 0 0 0 0 0 0 0 0 0 ...
$ Drama : int 0 0 0 1 1 1 1 1 1 1 ...
$ Fantasy : int 0 0 0 0 0 0 0 0 0 0 ...
$ FilmNoir : int 0 0 0 0 0 0 0 0 0 0 ...
$ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
$ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
$ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
$ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
$ SciFi : int 0 0 0 0 0 0 1 0 0 0 ...
$ Thriller : int 0 1 1 0 1 0 0 0 0 0 ...
$ War : int 0 0 0 0 0 0 0 0 0 1 ...
$ Western : int 0 0 0 0 0 0 0 0 0 0 ...
去除不需要的变量
movies$ID = NULL
movies$ReleaseDate = NULL
movies$VideoReleaseDate = NULL
movies$IMDB = NULL
去除重复列。
movies = unique(movies)
Hierarchical
首先介绍Hierarchical聚类方法,Hierarchical聚类方法每次将距离最近的两类合并为一类,直至只有一类为止,类和类之间的距离用中心之间的距离来计算,结果可以用如下树状图表示:
接着人为选择分为几类:
下面在R中操作。
Compute distances
distances = dist(movies[2:20], method = "euclidean")
Hierarchical clustering
clusterMovies = hclust(distances, method = "ward.D")
Plot the dendrogram
plot(clusterMovies)
最下面的部分之所以是黑的,是因为起初一个数据归为一类。
Assign points to clusters
接着选择聚类的数量
clusterGroups = cutree(clusterMovies, k = 10)
查看每类的数据
tapply(movies$Action, clusterGroups, mean)
- 1
- 0.178451178451178
- 2
- 0.78391959798995
- 3
- 0.123853211009174
- 4
- 0
- 5
- 0
- 6
- 0.1015625
- 7
- 0
- 8
- 0
- 9
- 0
- 10
- 0
tapply(movies$Romance, clusterGroups, mean)
- 1
- 0.104377104377104
- 2
- 0.0452261306532663
- 3
- 0.036697247706422
- 4
- 0
- 5
- 0
- 6
- 1
- 7
- 1
- 8
- 0
- 9
- 0
- 10
- 0
Kmeans
接着使用Kmeans
setwd("E:\\The Analytics Edge\\Unit 6 Clustering")
data = read.csv("dailykos.csv")
运行算法
set.seed(1000)
KMC = kmeans(data, centers=7)
str(KMC)
List of 9
$ cluster : int [1:3430] 4 4 6 4 1 4 7 4 4 4 ...
$ centers : num [1:7, 1:1545] 0.0342 0.0556 0.0253 0.0136 0.0491 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:7] "1" "2" "3" "4" ...
.. ..$ : chr [1:1545] "abandon" "abc" "ability" "abortion" ...
$ totss : num 896461
$ withinss : num [1:7] 76583 52693 99504 258927 88632 ...
$ tot.withinss: num 730632
$ betweenss : num 165829
$ size : int [1:7] 146 144 277 2063 163 329 308
$ iter : int 7
$ ifault : int 0
- attr(*, "class")= chr "kmeans"
查看各类的数据
table(KMC$cluster)
1 2 3 4 5 6 7
146 144 277 2063 163 329 308
补充
注意运行聚类算法前一般要把数据正规化,这是为了消除数量级的影响,可以按如下方式操作。
library(caret)
preproc = preProcess(data)
Loading required package: lattice
Loading required package: ggplot2
本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 Doraemonzzz!
评论
ValineLivere